CSC 575 Intelligent Information Retrieval

Homework #2

Siravich Khongrod (In-class)

http://condor.depaul.edu/ntomuro/courses/575/assign/HW2.html

There are 3 Parts

1. Word Frequency

The TED dataset "ted_main.csv" contains information about all audio-video recordings of TED Talks uploaded to the official TED.com website until September 21st, 2017,including number of views, number of comments, short descriptions, speaker names and titles.

Write code to obtain the following information of the words and tokens that appeared in the description of all talks (in the 'description' column in the dataset) after processing the text in the specified ways. Essentially, your task is to fill the following this table:

In [1]:
import pandas as pd
ted_raw = pd.read_csv('ted_main.csv', encoding = 'utf8')
print(ted_raw.columns)
ted_raw.head()
Index(['comments', 'description', 'duration', 'event', 'film_date',
       'languages', 'main_speaker', 'name', 'num_speaker', 'published_date',
       'ratings', 'related_talks', 'speaker_occupation', 'tags', 'title',
       'url', 'views'],
      dtype='object')
Out[1]:
comments description duration event film_date languages main_speaker name num_speaker published_date ratings related_talks speaker_occupation tags title url views
0 4553 Sir Ken Robinson makes an entertaining and pro... 1164 TED2006 1140825600 60 Ken Robinson Ken Robinson: Do schools kill creativity? 1 1151367060 [{'id': 7, 'name': 'Funny', 'count': 19645}, {... [{'id': 865, 'hero': 'https://pe.tedcdn.com/im... Author/educator ['children', 'creativity', 'culture', 'dance',... Do schools kill creativity? https://www.ted.com/talks/ken_robinson_says_sc... 47227110
1 265 With the same humor and humanity he exuded in ... 977 TED2006 1140825600 43 Al Gore Al Gore: Averting the climate crisis 1 1151367060 [{'id': 7, 'name': 'Funny', 'count': 544}, {'i... [{'id': 243, 'hero': 'https://pe.tedcdn.com/im... Climate advocate ['alternative energy', 'cars', 'climate change... Averting the climate crisis https://www.ted.com/talks/al_gore_on_averting_... 3200520
2 124 New York Times columnist David Pogue takes aim... 1286 TED2006 1140739200 26 David Pogue David Pogue: Simplicity sells 1 1151367060 [{'id': 7, 'name': 'Funny', 'count': 964}, {'i... [{'id': 1725, 'hero': 'https://pe.tedcdn.com/i... Technology columnist ['computers', 'entertainment', 'interface desi... Simplicity sells https://www.ted.com/talks/david_pogue_says_sim... 1636292
3 200 In an emotionally charged talk, MacArthur-winn... 1116 TED2006 1140912000 35 Majora Carter Majora Carter: Greening the ghetto 1 1151367060 [{'id': 3, 'name': 'Courageous', 'count': 760}... [{'id': 1041, 'hero': 'https://pe.tedcdn.com/i... Activist for environmental justice ['MacArthur grant', 'activism', 'business', 'c... Greening the ghetto https://www.ted.com/talks/majora_carter_s_tale... 1697550
4 593 You've never seen data presented like this. Wi... 1190 TED2006 1140566400 48 Hans Rosling Hans Rosling: The best stats you've ever seen 1 1151440680 [{'id': 9, 'name': 'Ingenious', 'count': 3202}... [{'id': 2056, 'hero': 'https://pe.tedcdn.com/i... Global health expert; data visionary ['Africa', 'Asia', 'Google', 'demo', 'economic... The best stats you've ever seen https://www.ted.com/talks/hans_rosling_shows_t... 12005869
In [2]:
import nltk
from nltk.tokenize import word_tokenize, wordpunct_tokenize, sent_tokenize
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')
ted_raw['description'].head()
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\skhongro\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\skhongro\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Out[2]:
0    Sir Ken Robinson makes an entertaining and pro...
1    With the same humor and humanity he exuded in ...
2    New York Times columnist David Pogue takes aim...
3    In an emotionally charged talk, MacArthur-winn...
4    You've never seen data presented like this. Wi...
Name: description, dtype: object
In [3]:
nobs = len(ted_raw['description'])
docs_token = [None] * nobs
docs_token_raw = [None] * nobs
docs_token_stem = [None] * nobs
porter = nltk.PorterStemmer()

for i in range(0,nobs):
    docs_token_raw[i] = word_tokenize(ted_raw['description'][i])
    # filter English stopwords + lowercase
#     docs_token[i] = [w.lower() for w in docs_token_raw[i] if w not in stopwords.words('english')] 
    docs_token[i] = [w.lower() for w in docs_token_raw[i] if w.lower() not in stopwords.words('english')]
    # filter tokens that contain non-alphabetic character(s)
    docs_token[i] = [w for w in docs_token[i] if w.isalpha()] 
    docs_token_stem[i] = [porter.stem(tok) for tok in docs_token[i]] # apply stemmer

print(docs_token_raw[0])
print(docs_token[0])
print(docs_token_stem[0])
['Sir', 'Ken', 'Robinson', 'makes', 'an', 'entertaining', 'and', 'profoundly', 'moving', 'case', 'for', 'creating', 'an', 'education', 'system', 'that', 'nurtures', '(', 'rather', 'than', 'undermines', ')', 'creativity', '.']
['sir', 'ken', 'robinson', 'makes', 'entertaining', 'profoundly', 'moving', 'case', 'creating', 'education', 'system', 'nurtures', 'rather', 'undermines', 'creativity']
['sir', 'ken', 'robinson', 'make', 'entertain', 'profoundli', 'move', 'case', 'creat', 'educ', 'system', 'nurtur', 'rather', 'undermin', 'creativ']

  [A]
Word tokenization
(only)
[B]
Word tokenization
+ Case folding
(lower-case)
+ Stopword filtering
+ Non-alphabet filtering
[C]
Word tokenization
+ Case folding
(lower-case)
+ Stopword filtering
+ Non-alphabet filtering
+ Porter stemming
(1) Total # of tokens      
(2) Size of vocabulary      
(3) Top 20 most common
token types with frequency (list
in descending order of frequency)
     
(4) Percentage of tokens in the
dataset that is covered by the
top 20 token types
     

NOTES:

  • IMPORTANT: When you open the dataset file, you must give an optional parameter encoding='utf-8' (because the file contains some characters which are not ascii).
  • Do not include the header row in the dataset.
  • Relevant functionality in NLTK to use: functions sent_tokenize() and word_tokenize() in package ntlk.tokenize; class PorterStemmer in package nltk.stem.porter; the stopwords corpus in package nltk.corpus.
  • For 'Non-alphabet filtering', use the function 'isalpha()' in Python.
In [5]:
def token_counter(docs_token_raw):
    print('[1] '+str(len([token for doc_t in docs_token_raw for token in doc_t]))) # -> sum(fdist.values())
    # Count frequencies of the vocabulary terms
    fdist = nltk.FreqDist([token for doc_t in docs_token_raw for token in doc_t])
    # fdist is essentially a Python dictionary
    tfpairs = fdist.items()
    print('[2] '+str(len(tfpairs))) # number of unique tokens
    print('[3] '+str(fdist.most_common(20)))
    print('[4] '+str(round(sum([item[1] for item in fdist.most_common(20)])/sum(fdist.values())*100,2))+'%')
    print()
    
token_counter(docs_token_raw)
token_counter(docs_token)
token_counter(docs_token_stem)
[1] 151994
[2] 17877
[3] [(',', 7382), ('.', 5764), ('the', 5395), ('and', 4264), ('of', 3651), ('to', 3528), ('a', 3505), ('in', 1762), ('--', 1485), ('that', 1472), ("'s", 1217), ('for', 1140), ('``', 898), ("''", 893), ('with', 879), ('we', 878), ('is', 834), ('it', 833), ('?', 824), ('this', 812)]
[4] 31.2%

[1] 78428
[2] 14816
[3] [('in', 762), ('talk', 700), ('us', 643), ('world', 515), ('new', 415), ('says', 411), ('people', 332), ('shares', 326), ('the', 306), ('shows', 282), ('life', 274), ('one', 272), ('ted', 254), ('like', 251), ('make', 239), ('way', 227), ('he', 224), ('human', 205), ('but', 205), ('work', 203)]
[4] 8.98%

[1] 78428
[2] 10676
[3] [('talk', 880), ('in', 762), ('us', 643), ('world', 527), ('say', 453), ('make', 449), ('share', 444), ('new', 415), ('show', 371), ('use', 360), ('work', 356), ('peopl', 334), ('human', 330), ('way', 326), ('one', 307), ('stori', 307), ('the', 306), ('live', 282), ('help', 281), ('life', 274)]
[4] 10.72%

Based on the results you obtained, answer the following questions. Answer in detail.

Did the size of vocabulary (2) decrease significantly from [A] to [B]? Why do you think it did/didn't?

The size decreased signigicantly because tokens are eliminated by filtering and are merged by case-folding.

Did the size of vocabulary (2) decrease significantly from [B] to [C]? Why do you think it did/didn't?

The size decreased signigicantly because vacabs are stemmed and are merged into roots. For instance, having [entertainer, entertaining, entertainment] merged into one vocabulary

How did the percentage of the top 20 token types (4) change from [B] to [C]? What do you think influenced the change?

Because the stemming reduces the number of vocabulary and the number of tokens are the same, it is trivial that the most occured words would have more percentage share. For instance, talk in [B] only include tokens as 'talk' exact match whereas 'talk' in [C] might include [talk, talked, talks] in which is stemmed into a single vocabulary 'talk'.

# [A] [B] [C]
(1) # of tokens 151994 78428 78428
(2) Size of vocab 17877 14816 10676
(3) Top 20 common [(',', 7382), ('.', 5764), ('the', 5395), ('and', 4264), ('of', 3651), ('to', 3528), ('a', 3505), ('in', 1762), ('--', 1485), ('that', 1472), ("'s", 1217), ('for', 1140), ('``', 898), ("''", 893), ('with', 879), ('we', 878), ('is', 834), ('it', 833), ('?', 824), ('this', 812)] [('in', 762), ('talk', 700), ('us', 643), ('world', 515), ('new', 415), ('says', 411), ('people', 332), ('shares', 326), ('the', 306), ('shows', 282), ('life', 274), ('one', 272), ('ted', 254), ('like', 251), ('make', 239), ('way', 227), ('he', 224), ('human', 205), ('but', 205), ('work', 203)] [('talk', 880), ('in', 762), ('us', 643), ('world', 527), ('say', 453), ('make', 449), ('share', 444), ('new', 415), ('show', 371), ('use', 360), ('work', 356), ('peopl', 334), ('human', 330), ('way', 326), ('one', 307), ('stori', 307), ('the', 306), ('live', 282), ('help', 281), ('life', 274)]
(4) % top 20 tokens in ds 31.20% 8.98% 10.72%

2. Word cloud

Using the tags associated with talks in the TED dataset, create a word cloud for tags 'climate change' and 'AI'.

There are many tools and reference sites available that help you create word clouds (such as this, this and a search result). Any will do. You pick one and figure out how to use it.

Make one cloud for each tag. Copy/paste the generated clouds in your submission file.

NOTE:

  • Entries in the tags column are strings (e.g. "['children', 'creativity', 'culture' ]"), NOT lists. So you have to first convert the string representation of a list (of strings) into a real list. This site shows you how.
  • Each talk is tagged with several tags. Take all talks (their descriptions) whose tags indicated in the dataset include 'climate change' or 'AI'.
  • Text/string processing (e.g. case-folding, stemming) is not necessary for this problem, but if you like to do some, you are more than welcome to do so.
  • For your interest, here is my simple word cloud for 'AI'.

In [4]:
from wordcloud import WordCloud
ted_raw['tags'].head()
Out[4]:
0    ['children', 'creativity', 'culture', 'dance',...
1    ['alternative energy', 'cars', 'climate change...
2    ['computers', 'entertainment', 'interface desi...
3    ['MacArthur grant', 'activism', 'business', 'c...
4    ['Africa', 'Asia', 'Google', 'demo', 'economic...
Name: tags, dtype: object
In [5]:
# convert string representation of list to list in Python
# https://www.tutorialspoint.com/How-to-convert-string-representation-of-list-to-list-in-Python
import ast
str(ted_raw['tags'].head()[1])
ast.literal_eval(str(ted_raw['tags'].head()[1]))
# for items in tags:
#     print(ast.literal_eval(items))
    
# print(([tag for items in tags for tag in ast.literal_eval(items)]))
Out[5]:
['alternative energy',
 'cars',
 'climate change',
 'culture',
 'environment',
 'global issues',
 'science',
 'sustainability',
 'technology']
In [15]:
import ast
tags = ted_raw['tags']
# fdist = nltk.FreqDist([tag for items in tags for tag in ast.literal_eval(items)])
doc_tags = [(ast.literal_eval(items)) for items in tags]
doc_tags

i=0
docs_ai, docs_cli = [], []
for tags in doc_tags:
    if('AI' in tags):
        docs_ai.append(i)
    if('climate change' in tags):
        docs_cli.append(i)
    i+=1
In [18]:
# fdist = nltk.FreqDist([tag for tags in tags_cli for tag in tags if tag not in 'climate change'])
from wordcloud import WordCloud
import matplotlib.pyplot as plt
wordcloud = WordCloud().generate(' '.join([token for i in docs_cli for token in docs_token[i]]))
plt.figure(1)
plt.figure(figsize=(16, 8), dpi=300)
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

wordcloud = WordCloud().generate(' '.join([token for i in docs_ai for token in docs_token[i]]))
plt.figure(2)
plt.figure(figsize=(16, 8), dpi=300)
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>

3. Inverted Index

Create an inverted index for the TED dataset, using the text processing scheme [C] in the first problem (i.e, with Porter stemming) and the url of the talk as the document name. The index will consist of the following files:

NOTES: Since inverted index uses the ID of the terms and documents, you probably have to do the task in a few steps, where you store intermediate results in some data structures or write in temporary files. Any way is fine, as long as you accomplish the task. IMPORTANT: When you write "term_index.csv" and "inverted_index.csv", you have to open the file (for writing) with an optional parameter encoding='utf-8'.

1. A file that maps each term name to: the term ID and the document frequency. Assign a series of positive integers to the term names. Store each term name per line. Name the file "TED_term_index.csv". The first couple of lines in the file should look like this:

a, 1, 2
aakash, 2, 1
aala, 3, 1
aamodt, 4, 1
aaron, 5, 4
In [145]:
f = open("TED_term_index.csv","w",encoding='UTF-8')
fdist = nltk.FreqDist([token for doc_t in docs_token_stem for token in doc_t])
tfpairs = fdist.items()

i = 1
for tf in sorted(tfpairs):
    if (i<10): print(tf[0] + ", " + str(i) + ", " + str(tf[1]))
    f.write(tf[0] + ", " + str(i) + ", " + str(tf[1])+'\n')
    i+=1
    
f.close()
a, 1, 2
aakash, 2, 1
aala, 3, 1
aamodt, 4, 1
aaron, 5, 4
aaronson, 6, 1
ababa, 7, 1
abalon, 8, 1
abandon, 9, 11

2. A file that maps each document ID to: the document name. Assign a series of positive integers to the documents. Store each document ID per line. Name the file "TED_doc_index.csv". The first couple of lines in the file should look like this:

1, https://www.ted.com/talks/ken_robinson_says_schools_kill_creativity
2, https://www.ted.com/talks/al_gore_on_averting_climate_crisis
3, https://www.ted.com/talks/david_pogue_says_simplicity_sells
In [54]:
f=open("TED_doc_index.csv","w",encoding='utf-8')
i=1
for url in ted_raw['url']:
    if(i<10): print(str(i)+", "+url,end ="")
    f.write(str(i)+", "+url)
    i+=1
f.close()
1, https://www.ted.com/talks/ken_robinson_says_schools_kill_creativity
2, https://www.ted.com/talks/al_gore_on_averting_climate_crisis
3, https://www.ted.com/talks/david_pogue_says_simplicity_sells
4, https://www.ted.com/talks/majora_carter_s_tale_of_urban_renewal
5, https://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen
6, https://www.ted.com/talks/tony_robbins_asks_why_we_do_what_we_do
7, https://www.ted.com/talks/julia_sweeney_on_letting_go_of_god
8, https://www.ted.com/talks/joshua_prince_ramus_on_seattle_s_library
9, https://www.ted.com/talks/dan_dennett_s_response_to_rick_warren

3. The inverted index file that maps each term ID to its postings list. Each posting should contain a document ID and the term frequency in that document. Look at the lecture slides ("Implementation.pptx"), slide 5 for the scheme. Store one term ID per line. Name the file "TED_inverted_index.csv". The first couple of lines in the file should look like this:

1, 1146, 1, 2429, 1
2, 1878, 1
3, 2381, 1
4, 1655, 1
5, 810, 1, 943, 1, 951, 1, 1717, 1
In [147]:
from collections import Counter
word_count_dict = {}

# Built index by appending docId into dict which may include duplicates
for i in range(0,len(docs_token_stem)):
    for word in docs_token_stem[i]:
        word_count_dict.setdefault(word,[]).append(i+1)

# Calculate freq within docs
# [doc_ids for doc_ids in word_count_dict['grand']]
f = open("TED_inverted_index.csv","w",encoding='utf-8')
i = 1
for key in sorted(word_count_dict.keys()):
    c = (Counter(docid for docid in word_count_dict[key]))
    fout = str(i)+", "+', '.join([str(i) for item in c.items() for i in (list(item))])
    if(i<=10):
        print(key)
        print(fout)
    f.write(fout+'\n')
    i+=1

f.close()
a
1, 1146, 1, 2429, 1
aakash
2, 1878, 1
aala
3, 2381, 1
aamodt
4, 1655, 1
aaron
5, 810, 1, 943, 1, 951, 1, 1717, 1
aaronson
6, 1991, 1
ababa
7, 1611, 1
abalon
8, 923, 1
abandon
9, 973, 1, 1386, 1, 1451, 1, 1604, 1, 1818, 1, 1945, 1, 2129, 1, 2139, 1, 2267, 1, 2337, 1, 2409, 1
abani
10, 133, 1, 266, 1

Overall Comments

This is my first time using NLTK as oppose to Skitlearn Tokenizer. I think NLTK has a more complicated data structure that inherits my characteristics from dictionaries and tuples which makes it ambiguous to call functions. From looking at the documentation of those classes, I was able to complete this assignment. I certainly agree that knowledge in Data Structure is essential in dealing with the NLTK package especially for people who started programming in Python. As for myself, I had quite some background in other languages such as C and Java which really helped me to navigate through those mentioned documentation.

Deliverables

Submit the following: (1) Your answer file; (2) Three output files for problem 3; and (3) Your source code file(s).

(1) Your answer file must:

  • be a/ONE PDF file.
  • have your name, the course name (CSC 575) and section number, and the assignment number (HW#2). If this information is missing, your submission will be returned UNGRADED.
  • include your overall comments, for instance, how difficult you felt this assignment was, and any particular difficulties you encountered. Write in well-written English prose.

(2) Three output files must be comma separated.

(3) Your source code file must have your name, the course name (CSC 575) and section number, and the assignment number (HW#2) at the top of the file (in the comment section). If you used Jupyter Notebook, submit the html version of the code in addition to the ipynb file.